Comments on ”Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpoint”

نویسندگان

  • Guillaume Aupy
  • Yves Robert
  • Frédéric Vivien
  • Dounia Zaidouni
چکیده

In this short note, we provide some comments on the recent paper “Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing” by Bouguerra et al., published in [3]. We start by identifying some errors in their equations. Then we explain that they do not actually use the distribution of lead times, contrary to statements by the authors. Finally, we show that their algorithm does not change policy at the best possible moment, and we point to our own work [2] for the (correct version of the) optimal algorithm. Key-words: fault tolerance, checkpointing, prediction, algorithms, model, exascale ∗ LIP, École Normale Supérieure de Lyon, France † University of Tennessee Knoxville, USA ‡ Institut Universitaire de France § INRIA Commentaires sur l’article “Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpoint” Résumé : Dans cette courte note nous commentons l’article “Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing” de Bouguerra et al. [3]. Nous commençons par identifier des erreurs dans la mise en équation du problème. Nous expliquons ensuite que, contrairement à ce qu’ils prétendent, les auteurs n’utilisent pas la distribution du délai de prédiction (lead time). Finalement, nous montrons que leur algorithme ne change pas de politique au moment optimum, et nous indiquons que nous avons présenté l’algorithme optimal dans un rapport de recherche [2]. Mots-clés : Tolérance aux pannes, checkpoint, prédiction, algorithmes, modèle, exascale Comments on “Improving the computing efficiency using proactive and preventive checkpoint” 3

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Enhanced MSS-based checkpointing Scheme for Mobile Computing Environment

Mobile computing systems are made up of different components among which Mobile Support Stations (MSSs) play a key role. This paper proposes an efficient MSS-based non-blocking coordinated checkpointing scheme for mobile computing environment. In the scheme suggested nearly all aspects of checkpointing and their related overheads are forwarded to the MSSs and as a result the workload of Mobile ...

متن کامل

A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud

As high-performance computing (HPC) systems continue to increase in scale, their mean-time to interrupt decreases respectively. The current state of practice for fault tolerance (FT) is checkpoint/restart. However, with increasing error rates, increasing aggregate memory and not proportionally increasing I/O capabilities, it is becoming less efficient. Proactive FT avoids experiencing failures ...

متن کامل

Improving the palbimm scheduling algorithm for fault tolerance in cloud computing

Cloud computing is the latest technology that involves distributed computation over the Internet. It meets the needs of users through sharing resources and using virtual technology. The workflow user applications refer to a set of tasks to be processed within the cloud environment. Scheduling algorithms have a lot to do with the efficiency of cloud computing environments through selection of su...

متن کامل

Improving Mobile Grid Performance Using Fuzzy Job Replica Count Determiner

Grid computing is a term referring to the combination of computer resources from multiple administrative domains to reach a common computational platform. Mobile Computing is a Generic word that introduces using of movable, handheld devices with wireless communication, for processing data. Mobile Computing focused on providing access to data, information, services and communications anywhere an...

متن کامل

Improving Mobile Grid Performance Using Fuzzy Job Replica Count Determiner

Grid computing is a term referring to the combination of computer resources from multiple administrative domains to reach a common computational platform. Mobile Computing is a Generic word that introduces using of movable, handheld devices with wireless communication, for processing data. Mobile Computing focused on providing access to data, information, services and communications anywhere an...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013